Objectives

In this project, we will do exploratory data analysis on a bank’s data set, which was obtained from https://archive.ics.uci.edu/ml/datasets/bank+marketing, and try to gain insights regarding the distribution of the dataset through the implementation of data visualization.

knitr::include_graphics("assets/bank.jpeg")

Libraries and Setup

In this section, we will import the necessary libraries for this project.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.2     ✓ forcats 0.5.0
## ✓ readr   1.4.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(glue)
## 
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
## 
##     collapse
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Data Reading

Here, we will read the csv data file into our IDE, do further inspection regarding our data, and perform suitable data cleansing for further processing.

bank <- read.csv("bank/bank.csv", sep = ";")

Data pre-processing

Here we will perform an integral part of preparing our data into an understandable and complete format for the next step of machine learning.

Data Inspection

glimpse(bank)
## Rows: 4,521
## Columns: 17
## $ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31,…
## $ job       <chr> "unemployed", "services", "management", "management", "blue…
## $ marital   <chr> "married", "married", "single", "married", "married", "sing…
## $ education <chr> "primary", "secondary", "tertiary", "tertiary", "secondary"…
## $ default   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…
## $ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 2…
## $ housing   <chr> "no", "yes", "yes", "yes", "yes", "no", "yes", "yes", "yes"…
## $ loan      <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "no", "ye…
## $ contact   <chr> "cellular", "cellular", "cellular", "unknown", "unknown", "…
## $ day       <int> 19, 11, 16, 3, 5, 23, 14, 6, 14, 17, 20, 17, 13, 30, 29, 29…
## $ month     <chr> "oct", "may", "apr", "jun", "may", "feb", "may", "may", "ma…
## $ duration  <int> 79, 220, 185, 199, 226, 141, 341, 151, 57, 313, 273, 113, 3…
## $ campaign  <int> 1, 1, 1, 4, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 5, 1, 1, 1,…
## $ pdays     <int> -1, 339, 330, -1, -1, 176, 330, -1, -1, 147, -1, -1, -1, -1…
## $ previous  <int> 0, 4, 1, 0, 0, 3, 2, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 1,…
## $ poutcome  <chr> "unknown", "failure", "failure", "unknown", "unknown", "fai…
## $ y         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no",…

Input variables:

  1. age (numeric)
  2. job : type of job (categorical: “admin.”, “unknown”, “unemployed”, “management”, “housemaid”, “entrepreneur”, “student”, “blue-collar”, “self-employed”, “retired”, “technician”, “services”)
  3. marital : marital status (categorical: “married”,“divorced”,“single”; note: “divorced” means divorced or widowed)
  4. education (categorical: “unknown”,“secondary”,“primary”,“tertiary”)
  5. default: has credit in default? (binary: “yes”,“no”)
  6. balance: average yearly balance, in euros (numeric)
  7. housing: has housing loan? (binary: “yes”,“no”)
  8. loan: has personal loan? (binary: “yes”,“no”)
  9. contact: contact communication type (categorical: “unknown”,“telephone”,“cellular”)
  10. day: last contact day of the month (numeric)
  11. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  12. duration: last contact duration, in seconds (numeric)
  13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
  15. previous: number of contacts performed before this campaign and for this client (numeric)
  16. poutcome: outcome of the previous marketing campaign (categorical: “unknown”,“other”,“failure”,“success”)
bank %>% head()
bank %>% tail()
summary(bank)
##       age            job              marital           education        
##  Min.   :19.00   Length:4521        Length:4521        Length:4521       
##  1st Qu.:33.00   Class :character   Class :character   Class :character  
##  Median :39.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :41.17                                                           
##  3rd Qu.:49.00                                                           
##  Max.   :87.00                                                           
##    default             balance        housing              loan          
##  Length:4521        Min.   :-3313   Length:4521        Length:4521       
##  Class :character   1st Qu.:   69   Class :character   Class :character  
##  Mode  :character   Median :  444   Mode  :character   Mode  :character  
##                     Mean   : 1423                                        
##                     3rd Qu.: 1480                                        
##                     Max.   :71188                                        
##    contact               day           month              duration   
##  Length:4521        Min.   : 1.00   Length:4521        Min.   :   4  
##  Class :character   1st Qu.: 9.00   Class :character   1st Qu.: 104  
##  Mode  :character   Median :16.00   Mode  :character   Median : 185  
##                     Mean   :15.92                      Mean   : 264  
##                     3rd Qu.:21.00                      3rd Qu.: 329  
##                     Max.   :31.00                      Max.   :3025  
##     campaign          pdays           previous         poutcome        
##  Min.   : 1.000   Min.   : -1.00   Min.   : 0.0000   Length:4521       
##  1st Qu.: 1.000   1st Qu.: -1.00   1st Qu.: 0.0000   Class :character  
##  Median : 2.000   Median : -1.00   Median : 0.0000   Mode  :character  
##  Mean   : 2.794   Mean   : 39.77   Mean   : 0.5426                     
##  3rd Qu.: 3.000   3rd Qu.: -1.00   3rd Qu.: 0.0000                     
##  Max.   :50.000   Max.   :871.00   Max.   :25.0000                     
##       y            
##  Length:4521       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
colSums(is.na(bank))
##       age       job   marital education   default   balance   housing      loan 
##         0         0         0         0         0         0         0         0 
##   contact       day     month  duration  campaign     pdays  previous  poutcome 
##         0         0         0         0         0         0         0         0 
##         y 
##         0

Data Cleansing

In this section of the report, we will make necessary adjustments to the data set in order to make it feasible for further processing and data exploration.

As can be seen from the data description above which was provided by the source, the columns c(“job”, “marital”, “education”, “default”, “housing”, “loan”, “contact”, “month”, “poutcome”, “y”) should have had categorical values. However, in the above inspection by using the glimpse function, it can be seen that they are still character data types. Hence, below they are transformed into the form of factors.

bank <- bank %>% 
  mutate_at(c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome", "y"), as.factor)
str(bank)
## 'data.frame':    4521 obs. of  17 variables:
##  $ age      : int  30 33 35 30 59 35 36 39 41 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 11 8 5 5 2 5 7 10 3 8 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 2 3 2 2 3 2 2 2 2 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 1 2 3 3 2 3 3 2 3 1 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ balance  : int  1787 4789 1350 1476 0 747 307 147 221 -88 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 1 2 2 2 2 1 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 2 1 2 1 1 1 1 1 2 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 1 1 1 3 3 1 1 1 3 1 ...
##  $ day      : int  19 11 16 3 5 23 14 6 14 17 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 11 9 1 7 9 4 9 9 9 1 ...
##  $ duration : int  79 220 185 199 226 141 341 151 57 313 ...
##  $ campaign : int  1 1 1 4 1 2 1 2 2 1 ...
##  $ pdays    : int  -1 339 330 -1 -1 176 330 -1 -1 147 ...
##  $ previous : int  0 4 1 0 0 3 2 0 0 2 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 1 1 4 4 1 2 4 4 1 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Below, we iterate through each columns to check the levels of each categorical variables for every factor data types in our data set.

for (column in names(bank)) {
  if (is.factor(bank[,column])) {
    print(paste(column, ":", sep = " "))
    print(levels(bank[,column]))
  }
}
## [1] "job :"
##  [1] "admin."        "blue-collar"   "entrepreneur"  "housemaid"    
##  [5] "management"    "retired"       "self-employed" "services"     
##  [9] "student"       "technician"    "unemployed"    "unknown"      
## [1] "marital :"
## [1] "divorced" "married"  "single"  
## [1] "education :"
## [1] "primary"   "secondary" "tertiary"  "unknown"  
## [1] "default :"
## [1] "no"  "yes"
## [1] "housing :"
## [1] "no"  "yes"
## [1] "loan :"
## [1] "no"  "yes"
## [1] "contact :"
## [1] "cellular"  "telephone" "unknown"  
## [1] "month :"
##  [1] "apr" "aug" "dec" "feb" "jan" "jul" "jun" "mar" "may" "nov" "oct" "sep"
## [1] "poutcome :"
## [1] "failure" "other"   "success" "unknown"
## [1] "y :"
## [1] "no"  "yes"

EDA (Exploratory Data Analysis)

In this section, we will be performing concepts of feature engineering and data visualization in effort to successfully demonstrate exploratory data analysis concepts towards the audience.

Feature Engineering

Here, we select only a small group of columns that will be needed to produce the boxplot of job - balance distribution across our dataframe. Here, we perform a label encoding to distinguish the different jobs using a new variable named job_id to enhance the aesthetics in our plotting by preventing colliding jobs in the x-axis of the label.

balance_job_agg <- bank %>% 
  mutate(job_id = factor(as.numeric(bank$job))) %>% 
  select(c("balance", "job", "job_id"))
balance_job_agg %>% head()

With the code chunk below, we subset the necessary columns with some adjustments for plotting job_id against number of acceptance of the product.

yes_job_agg <- bank %>% 
  mutate(job_id = factor(as.numeric(bank$job))) %>% 
  filter(y == "yes") %>% 
  select(c("job", "job_id", "y")) %>% 
  group_by(job_id) %>% 
  count(y) %>% 
  mutate(tooltip = glue("Total Acceptance: {(n)}"))
yes_job_agg %>% head()
no_job_agg <- bank %>% 
  mutate(job_id = factor(as.numeric(bank$job))) %>% 
  filter(y == "no") %>% 
  select(c("job", "job_id", "y")) %>% 
  group_by(job_id) %>% 
  count(y) %>% 
  mutate(tooltip = glue("Total Rejections: {(n)}"))
no_job_agg %>% head()
yes_job_agg$ratio  <- yes_job_agg$n/no_job_agg$n
acceptance_rate <- bank %>% 
  mutate(job_id = factor(as.numeric(bank$job))) %>% 
  select(c("job", "job_id")) %>% 
  group_by(job_id) %>% 
  inner_join(yes_job_agg) %>% 
  select(c("job_id", "job", "ratio")) %>% 
  mutate(tooltip = glue("Ratio: {(ratio)}%")) %>% 
  arrange(job_id)
## Joining, by = "job_id"
acceptance_rate <- acceptance_rate %>% unique()
acceptance_rate %>% head()
yes_count <- bank %>% 
  group_by(marital) %>% 
  filter(y=="yes") %>% 
  count(y)
no_count <- bank %>% 
  group_by(marital) %>% 
  filter(y=="no") %>% 
  count(y)
marital_y_agg <- rbind(yes_count, no_count)
marital_y_agg

Data Visualization

The following is the label encoding that will be used in visualizing the distributions involving job_id.

balance_job_agg %>% select(c("job", "job_id")) %>% distinct() %>% arrange(job_id)

A visualization using boxplot towards the distribution of balance among different jobs. From the below interpretation, we can notice the presence of many outliers that might skew the distribution upwards. However, there are also some extreme negative outliers that can be found in our dataframe, namely in the entrepreneur and self-employed sectors.

ggplot(balance_job_agg, aes(x = job_id, y = balance)) +
   geom_boxplot(aes(fill = balance)) +
   labs(title = "Balance According to Job", x= "job_id", y= "Balance",
        subtitle = "Avg. price indicated by the dotted red line") +
   theme(plot.title = element_text(hjust = 0.5)) +
   geom_hline(yintercept = mean(balance_job_agg$balance), color = "red", linetype = 5)

A plotting using geom_col to view the product acceptance distribution among different kind of jobs. It seems that the product is most well-perceived by the people in management field as reflected on the graph.

p <- ggplot(yes_job_agg, aes(x = job_id, y= n, fill = n, text = tooltip)) +
  geom_col(position = "identity") + 
  scale_fill_gradient(low = "#e4333e", high = "#52171a") +
  labs(title = "Product Acceptance Distribution Among Jobs", x ="job_id", y= "amount")+
  theme_minimal()
ggplotly(p, tooltip = c('text'))

The following is the product rejection distribution among different jobs as opposed to the above shown graph.

p <- ggplot(no_job_agg, aes(x = job_id, y= n, fill = n, text = tooltip)) +
  geom_col(position = "identity") + 
  scale_fill_gradient(low = "#69eeee", high = "#157373") +
  labs(title = "Product Rejection Distribution Among Jobs", x ="job_id", y= "amount")+
  theme_minimal()
ggplotly(p, tooltip = c('text'))
acc_p <- ggplot(acceptance_rate, aes(x = job_id, y = ratio, fill = ratio, text = tooltip)) +
  geom_col() +
  facet_grid( scales = "free_y")+
  geom_point(aes(col=ratio))+
  labs(title = "Acceptance Rate Among Each Job", x="job_id", y= "Ratio")+
  coord_flip()
ggplotly(acc_p, tooltip = c("text"))

The following graph attempts to illustrate the same product acceptance distributions on various marital status.

ggplot(marital_y_agg,aes(x = marital,y = n))+
   geom_col(aes(fill = y), position = "dodge") +
   labs (title = "Product Acceptance Distribution Among Marital Groups", x = "Marital Status", y = "Amount of Acceptance", fill ="Acceptance")+
   theme(plot.title = element_text(hjust = 0.5))

Conclusion

Based on the above visualizations, it seems that targeting single, retired persons might be the most beneficial for the company in terms of obtaining customers for their product. The chart which shows that people in management position being the category with the most acceptance towards the bank’s product can be inferred to be skewed due to the high amount of population within the sector as we can see the acceptance ratio is only about 1.6.